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Abstract 



In the equating literature, a reeurring concern is that equating functions that utilize a single 
anchor to account for examinee groups’ nonequivalence are biased when the groups are 
extremely different and/or when the anchor only weakly measures what the tests measure. 
Several proposals have been made to address this equating bias by incorporating more than one 
anchor into nonequivalent groups with anchor test (NEAT) equating functions. These proposals 
have not been extensively considered or comparatively evaluated. This study evaluates three 
methods for incorporating more than one anchor into NEAT equating functions, including 
poststratification, imputation, and propensity score matching. The three methods are studied and 
compared in two examples. The implications for using the three equating approaches in practice 
and for developing alternative strategies to incorporate two anchors are discussed. 

Key words: NEAT equating, multiple anchors, psychometrics, standardized tests 
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Background 

Nonequivalent groups with anehor test (NEAT) equating methods are traditionally based 
on using a single anehor to aceount for examinee group differenees (Braun & Holland, 1982; 
Kolen & Brennan, 2004; von Davier, Holland, & Thayer, 2004). These equating methods ean be 
extended so that more than one anehor is ineorporated. NEAT equating methods based on 
multiple anehors are potentially useful when the tests being equated measure sueh broad eontent 
that a single anehor may not refleet them, and/or when the examinee group differenees are so 
large that the use of a single anehor to estimate these differences may produce biased equating 
results (Angoff, 1984; Eivingston, 2004; Eord, 1960). 

Suggestions have been made for how to incorporate more than one anchor into NEAT 
equating (Angoff, 1984; Kolen, 1990; Eiou, Cheng & Ei, 2001; Eivingston, Dorans, & Wright, 
1990; Skaggs, 1990). These suggestions are fairly diverse and have included direct extensions of 
traditional single anchor equating methods and more elaborate propensity score matching and 
missing data imputation methods. Most of these suggestions have not been extensively 
researched or compared. The purpose of this paper is to develop and compare three proposed 
approaches for using two anchors to equate tests taken by nonequivalent examinee groups. 

This paper begins by describing the traditional NEAT data collection design where a 
single anchor is administered to both groups and the extension of this design to the situation 
where two anchors are administered. The assumptions for equating with one and two anchors are 
described primarily in terms of the poststratification NEAT equating method (von Davier et ah, 
2004). Poststratification equating provides a useful basis for understanding the three approaches 
of interest, including two-anchor poststratification (Angoff, 1984), missing data imputation (Eiou 
et ah, 2001), and propensity score matching (Eivingston, Dorans, & Wright, 1990). After being 
described, these three approaches are applied in two equating situations where the use of two 
anchors is expected to produce improved equating results (i.e., equating when anchors do not 
perfectly reflect the tests and equating when there are large examinee group differences). The 
final discussion focuses on the implications for using the three equating approaches in practice 
and for developing alternative strategies to incorporate two anchors into an equating function. 
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One- and Two-Anchor Nonequivalent Groups With Anchor Test (NEAT) Data Collection 
Designs and Equating 

Data collection and equating using one anchor. For the traditional NEAT design 
(Table 1), the data are eolleeted as two samples from nonequivalent populations {P and Q) that 
take different tests (A or Y) and the same anchor (A). The goal of equating is to produce a 
conversion from the scores of X to the scores of Y that eliminates the test forms’ difficulty 
differences. The equating conversion must account for how examinee group differences 
influence the test scores. One way to address examinee group differences is to use the groups’ A 
scores to estimate the X and Y distributions for a hypothetical single population, T, that is a 
synthetic mixture of P and Q, 

T = wP + {l-w)Q, Q<w<\. (1) 

When X and Y data are available for population T, the A-to-T equating function can be computed. 
This equating approach is poststratification equating using a single anchor, A. 

Table 1 



The One-Anchor Nonequivalent Groups With Anchor Test (NEAT) Design 





New test (A) 


Anchor (A) 


Old test (Y) 


New group (P) 


V 


V 




Old group (Q) 




a/ 





The A and Y distributions in population T can be obtained through estimating T’s 
bivariate (A, A) and (T, A) distributions using the observed data (Table 1) and making 
assumptions about the unobserved data. For poststratification equating, the (A, A) probability 
distribution in T, Prob.j.(A, A) , can be estimated as 

Prob.p(A,A) = wProbp(A,A) + (l- w)ProbQ(A, A). (2) 

In Equation 2, Probp (A, A) is the joint (A, A) probability distribution observed in examinee 
group P. Probg(A, A) is the joint (A, A) probability distribution estimated for examinee group Q 
by assuming that the “A-given-A ” conditional probabilities observed in examinee group P, 
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Probp (X I A) , are group invariant and can be used to predict 2’ s joint (X, A) probability 
distribution based on Q’s A distribution, 



Probn(X, A) = Probp (X | A)ProbQ(A) 



Probp (X, A) 
Probp (A) 



Probo(A). 



( 3 ) 



Data collection and equating using two anchors. The interest of this study is in 
extending and comparing approaches such as Equations 2 and 3 to the situation where the P and 
Q groups’ data are collected for two anchors, Ai and A2 (Table 2). The goal of equating in this 
situation is the same as for the one-anchor situation: to produce a conversion from the scores of 
X to the scores of Y that eliminates the test forms’ difficulty differences. To use two anchors to 
account for examinee groups’ influences on test scores, population T’s trivariate distributions 
can be estimated by extending Equation 2, 



Probp (X , Ai, A2) = wProbp (X , Ai, A2) + (1 - wjProbg (X , Ai, A2), 



(4) 



by using the observed Probp(X, Ai, A2) and making group invariance assumptions that extend 
Equation 3, 



Probp(X, Ai, A2) 



Probp(X I Ai,A2)Probp(Ai,A2) 



Probp (X,Ai,A2) 
Probp (Ai,A2) 



Probp(Ai,A2). 



(5) 



Table 2 

The Two-Anchor Nonequivalent Groups With Anchor Test (NEAT) Design 





New test (X) Anchor (Ai) 


Anchor (A2) 


Old test (Y) 


New group (P) 


V V 


V 




Old group (Q) 


V 


V 





Two-anchor equating approaches based on Equations 4 and 5 may result in improved 
equating results relative to one-anchor poststratification equating based on Equations 2 and 3. 
One-anchor equating results based on Equations 2 and 3 are known to be inaccurate when test- 
anchor correlations are weak (Eivingston, 2004). The two anchors are likely to be more highly 
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correlated with the test scores than one anehor, meaning that the use of two anchors should give 
a more accurate account of how examinee group differenees influence test seores than one 
anehor. Equating research has shown that equating functions based on the two-anehor group 
invariance assumption in Equation 5 ean be more aecurate than equating functions based on the 
one-anchor group invariance assumption in Equation 3 (Dorans, Eiu, & Hammond, 2008). The 
next section describes three approaehes for ineorporating two anehors as in Equations 4 and 5. 

Three Equating Approaches Involving Two Anchors in the Nonequivalent Groups With 
Anchor Test (NEAT) Design 

The three equating approaches proposed for utilizing two anehors in NEAT equating are 
poststratification (Angoff, 1984), imputation (Eiou et al., 2001), and propensity score matching 
(Eivingston et al., 1990). All three approaches are based on similar assumptions-that the X and Y 
distributions that are not directly observed in P or Q can be estimated using P and Q'sAl and A2 
scores and the conditional relationships observed in Q and P. All three approaches can be used to 
implement the major steps of observed score equating (von Davier et al., 2004), including 
presmoothing, estimating the X and Y distributions for synthetic group T, continuizing the X and 
Y distributions, computing linear and eurvilinear equating functions, and assessing the equating 
functions with respect to their standard errors. The general characteristics of the 
poststratification, imputation, and propensity seore matching approaches are described in the 
following section. Additional details of the approaehes are described in the Appendix and in von 
Davier et al. 

Poststratification. The two-anchor poststratification equating method builds directly on 
the one-anchor poststratification method (Angoff, 1984; Eord, 1975). This approach extends the 
one-anehor poststratifieation method (von Davier et al., 2004) by applying loglinear models to 
presmooth P’s trivariate (X, A1,A2) distribution and Q's trivariate (T, Al, A2) distribution, 
eomputing T’s X and Y distributions from the presmoothed distributions using Equations 4 and 
5, and computing linear and curvilinear A-to-T equating functions and their standard errors based 
on r’s X and Y distributions. Standard errors can be estimated for the differences between linear 
and curvilinear equating functions. In contrast to the imputation and propensity score matehing 
approaches, standard errors ean also be estimated for the differences between two-anchor and 
one-anchor equating functions (see Appendix). 



4 




Imputation. The application of missing data imputation (Little & Rubin, 1987) to the 
computation of synthetic population equating functions was considered by Liou and Cheng 
(1995) and Liou et al. (2001). In this imputation approach, population T’s X and Y distributions 
are estimated by the imputation of population Q's missing X data and population P’s missing Y 
data. The assumption of the imputation is that, given the anchor scores, the missing X data in Q 
and the missing Y data in P are missing at random and therefore imputable, based on the anchor 
scores’ observed relationships with the tests (Equations 3 and 5). To impute the missing data, 
Liou et al. modified Holland and Thayer’s (1987, 2000) loglinear presmoothing algorithm so that 
Equations 4 and 5 are used to repeatedly compute expectations of the missing data in an iterative 
expectation maximization (EM) algorithm. When the EM algorithm converges, population T’s X 
and Y distributions can be computed from T’s imputed and loglinear presmoothed {X, A1,A2) 
and (Y, A1,A2) distributions. The imputed X and Y distributions for population T imply a single 
group design, meaning that Y-to-T curvilinear and linear equating functions and their standard 
errors can be computed as single group equating functions (von Davier et al., 2004). 

Propensity score matching. The application of propensity score matching to equating 
was suggested in Eivingston et al. (1990). Rather than use the two anchors in their original form, 
a single variable (i.e., propensity score) is constructed as the weighted combination of Ai and A2 
that maximally predicts membership in the examinee administration groups. Eor example, a 
logistic regression that predicts membership in P can be estimated for all P and Q examinees’ 
data based on examinees’ anchor scores, 

Propensity(R | Al, A2) = ^ ^ ' (6) 

Examinees from P and Q who have the same Propensity(P | Al, A2) scores are considered 
equivalent (i.e., matched). Alternative parameterizations of Equation 6 could be used, and 
Equation 6 can be extended to a large number of anchors and matching variables. 

To apply propensity score matching to this study’s equating context, the recommended 
propensity score matching approach from Rosenbaum and Rubin (1984) is followed. In 
Rosenbaum and Rubin’s (1984) proposal, categories of P and Q’s Propensity(P | A1,A2) scores 
are formed based on the percentiles of the Propensity(P | A1,A2) scores and the P and Q 
examinees who fall into the same category are considered equivalent. In the current study, the 
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categorized propensity scores are used as a single anehor for estimating T’s X and Y distributions 
and equating X-to-F as in traditional one-anehor poststratifieation equating, that is, substituting 
the eategorized Propensity(P | A1,A2) seores for A in Equations 2 and 3. Other propensity seore 
matching approaches developed for nonrandomized medieal studies propose the use of the 
uneategorized propensity seores for drawing a small number of individuals from a large eontrol 
group to mateh eaeh individual from a small treatment group (Rosenbaum & Rubin, 1985; Rubin 
& Thomas, 1996; Rubin & Thomas, 2000). The use of eategorized propensity seores was 
followed rather than other propensity seore matehing approaehes beeause the eategorized 
propensity seores allow for using all available examinee data (not drawing samples from either P 
or Q) and for defining the equating group of interest as a weighted, synthetie mixture of P and 
Q’s data (i.e.. Equations 1, 2, and 4 where w does not have to be set to 0 or 1). 

This Study 

The diseussion from the previous seetion shows that the poststratifieation, imputation, 
and propensity seore matehing approaehes ean all be used to ineorporate two anehors to estimate 
a synthetie population’s equating funetion. Perhaps the two-anehor results of the three 
approaehes will be similar, but this has not been extensively eonsidered in prior work. Some 
researeh has shown that for situations involving one anehor, poststratifieation and imputation ean 
produce similar results (Liou & Cheng, 1995). Other work has shown that imputation based on 
one anehor and one demographic variable can produce results that are similar to those of 
unsmoothed poststratifieation equating based on one anehor (Liou et ah, 2001). Applying the 
poststratifieation and imputation approaches to situations involving two anchors should be useful 
for determining if these approaehes’ similarities hold when they are based on the same 
presmoothing models and when the approaehes are used to eompare two-anehor eurvilinear and 
linear funetions. 

The evaluation of the applieation of propensity seore matching to two-anchor equating 
funetions has been less researehed and is a more exploratory approaeh at this point than the 
poststratifieation and imputation approaehes. It would seem that the estimation and 
categorization of the propensity scores would introduce inaecuraey into the results. These 
potential inaeeuraeies were not deseribed in one study that assessed the potential of propensity 
seore matehing for ineluding demographie variables in equating applieations (Paek, Liu, & Oh, 
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2006). By comparing results based on propensity score matehing to those obtained from the 
poststratifieation and imputation approaehes, the eurrent study can provide bases for evaluating 
the aecuraey of equating results based on propensity score matching. 

The next two seetions apply the poststratifieation, imputation, and propensity seore 
matehing approaehes in two examples involving two possible anehors. In both situations, the use 
of two anehors is expeeted to improve equating. In the first example, tests are to be equated 
aeross extremely different examinee groups. In the seeond example, eomposite tests that inelude 
multiple-ehoice and eonstrueted response items and anehors are equated. 

First Example: Equating Across Very Different Groups With Internal and External 

Anchors^ 

In the following example, the two-anehor poststratifieation, imputation, and propensity 
seore matehing approaehes are used to produce a eonversion for the scores of two forms of a 
formula-seored, multiple-ehoiee mathematics test. The deseriptive statistics for the P group’s 
{X, Al, A2) seores and the Q group’s (T, Al, A2) seores are shown in Tables 3 and 4. A1 is a 16- 
item anehor that is internal to test forms X and Y and is the anehor that was intended to be used in 
the aetual X-to-T equating. The importanee of using two anehors {Al, A2) is apparent when the 
implieations of using only anehor Ai are described. Speeifically, Al ’s correlations with X and Y 
(0.90) ean be interpreted as not quite as large as would be desired to address the fairly large 
standardized mean differences between P and Q (-0.57). Test-anchor correlations that are not as 
large as desired and large standardized mean differenees on the anchor suggest that equating 
results based only on the use of Ai could be inaceurate (Livingston, 2004). 



Table 3 

First Example: Statistics for Test X and Anchors Al and A2 in P (Np - 13,639) 





Min. 

observed & 
(possible) 


Max 

observed & 
(possible) 


Mean 


SD 


Skew 


Kurtosis 


Correlations 
X Al A2 


X 


-5 & (-12) 


50 & (50) 


20.89 


10.48 


0.09 


-0.65 


1.00 




Al 


-4 & (-4) 


16&(16) 


7.47 


3.75 


-0.06 


-0.51 


0.90 


1.00 


A2 


200 & (200) 


800 & (800) 


609.38 


101.27 


-0.55 


0.05 


0.84 


0.76 1.00 



1 




Table 4 



First Example: Statistics for Test Y and Anchors A1 and A2 in Q (Nq - 11,389) 





Min. 

observed & 
(possible) 


Max 

observed & 
(possible) 


Mean 


SD 


Skew 


Kurtosis 


Correlations 
Y Al A2 


Y 


-8 & (-12) 


50 & (50) 


28.64 


9.72 


-0.42 


-0.15 


1.00 




Al 


-4 & (-4) 


16&(16) 


9.52 


3.45 


-0.36 


-0.24 


0.90 


1.00 


A2 


200 & (200) 


800 & (800) 


662.91 


83.14 


-0.75 


1.01 


0.80 


0.71 1.00 



A2 is a second external anchor, an equated and scaled score on a mathematics test from a 
different testing program. Similar to Al, the P group is of lower ability than the Q group on A2 
(i.e., standardized mean difference = -0.58; Tables 3 and 4). The correlations of A2 with X and Y 
are moderately high (0.84 and 0.80, respectively). The correlations of A2 with Ai are also 
moderately high (0.76 and 0.71), but perhaps not so large as to indicate that the anchors provide 
redundant information about examinee abilities. 

Multiple regression analyses show that the predictions of test scores X and Y can be 
improved from squared correlations of about 0.82 with only Ai to squared correlations of about 
0.87 with both Ai and A2. These improved squared correlations with the test scores are 
descriptive evidence that using both A 7 and A2 will provide a more accurate account of how 
examinee differences affect the X and Y test score differences and will enhance the accuracy of 
the X-to-T results. The statistical implications of using both A 7 and A2 are assessed directly on 
the X-to-T results. 

The Use of Al and A2 in the Equating Process 

The following paragraphs describe the results of using poststratification, imputation, and 
propensity score matching to compute the X-to-T scaling function using anchors A7 and A2. The 
major steps of equating are presented, including presmoothing, the estimation of the X and Y 
distributions in synthetic population T, and the comparison of linear and curvilinear two-anchor 
functions in terms of scaled score differences and standard errors. The major interest is using the 
poststratification, imputation, and propensity score matching approaches to gauge the importance 
of A2 for the actual scaling results. For this interest, the three approaches’ results based on using 
both Al and A2 will be compared to those based on using only A7 . 

Presmoothing. Loglinear models were used to presmooth the (X, A7, A2) and (T, A7, A2) 
trivariate distributions. For the poststratification method, the loglinear presmoothing is applied to 
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P’s (X, A1,A2) distribution and 2’s (F, A1,A2) distribution (e.g., von Davier et al., 2004). For 
missing data imputation, the loglinear presmoothing is applied to population T’s trivariate {X, 
A1,A2) and (Y, Al, A2) distributions (Liou & Cheng, 1995; Liou et ah, 2001). The loglinear 
models used to presmooth these four trivariate distributions were based on the same 
parameterization, fitting five moments in the marginal test distributions (X or Y), five moments in 
the Ai distributions, six moments in the A2 distributions, and the first and seeond eross-moments 
of the joint (test, Al), (test, A2), and (A1,A2) distributions and the (X, A1,A2) and (Y,A1,A2) 
distributions. These models were seleeted beeause they resembled the models aetually used to 
equate these data in praetiee, and also beeause evaluations of residuals and model fit indiees did 
not reveal obvious model misspecifieations. 

For the propensity score matching approach, propensity scores were estimated by 
predicting P and Q group membership for all of the P and Q data, using the logistic regression 
model in Equation 6. These propensity scores were divided into 10 categories, based on the 
predicted probabilities’ deciles. Sensitivity analyses were conducted to compare these 
categorized propensity scores to those produced from alternative logistic regression models and 
categorization schemes. The categorized propensity scores based on model Equation 6 and 
categories defined in terms of deciles were used because they had high correlations with tests X 
and Y (0.92) and because the standardized mean differences between P and Q for Al and A2 
within each of the 10 categories were smaller than with alternative models and categorizations. 
Bivariate loglinear presmoothing models were used to presmooth P’s (X, CategorizedPropensity) 
and Q’s (Y, CategorizedPropensity) bivariate distributions, fitting five moments in the test 
distributions, five moments in the categorized propensity score distribution, and the first cross- 
moment between the test and categorized propensity scores. 

Test score distribution estimation in synthetic population T. All of the presmoothing 
results from the presmoothing step were used to estimate the X and Y score distributions in the 
synthetic population, T = wP + (1 - w)Q, w = 0.5 . Eor the poststratification method, this 
estimation was done using Equations 4 and 5 and P and Q’s presmoothed trivariate distributions. 
Eor missing data imputation, this estimation was done using the trivariate distributions imputed 
for population T in the presmoothing step. Eor propensity score matching, this estimation was 
done using Equations 2 and 3 and P and Q’s presmoothed bivariate distributions of the tests and 
the categorized propensity scores. 
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The descriptive statistics of population T’s X and Y score distributions are shown in 
Table 5. The X and Y score distributions are plotted in Figures 1 and 2. The score distributions 
are essentially identical for the two-anchor poststratification and imputation approaches and are 
somewhat different for the propensity score matching approach. The X and Y means in Table 5 
indicate that X is more difficult than Thy 1.9 or 2.1 points. 

Table 5 



First Example: Synthetic Population Distributions for X and Y, Two-Anchor Matching 





Xp+Q 

PSE 


Xp+Q 

Imputation 


Xp+Q 

Propensity 
score matching 


Yp+Q 

PSE 


^P+Q 

Imputation 


Propensity 
score matching 


Mean 


23.76 


23.76 


23.69 


25.69 


25.69 


25.81 


SD 


10.52 


10.51 


10.51 


10.71 


10.71 


10.58 


Skew 


-0.10 


-0.10 


-0.13 


-0.30 


-0.30 


-0.26 


Kurtosis 


-0.60 


-0.60 


-0.61 


-0.47 


-0.47 


-0.47 



Note. PSE = poststratification equating. 
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Figure 1. First example. Relative frequency distributions of New Form X in Synthetic 
Population T based on the internal and external anchors. 
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Figure 2. First example. Relative frequency distributions of Reference Form Y in Synthetic 
Population T based on the internal and external anchors. 

Test score conversion functions and their evaluation. Several test score conversions 
based on the poststratification, imputation, and propensity score matching approaches were 
evaluated. To assess the extent of curvilinearity in the two-anchor conversions, linear and 
curvilinear Y-to-T kernel functions were computed from the three approaches’ X and Y 
distributions estimated in population T. The curvilinear versus linear score differences and the 
+/- 2 standard errors of these equated differences (SEEDs) are plotted in Eigures 3 
(poststratification), 4 (imputation), and 5 (propensity score matching). Eor all three approaches, 
the differences between the curvilinear and linear functions exceed two standard errors 
throughout most of the score range. The largest differences between the curvilinear and linear 
functions occur at the minimum and maximum scores. These differences based on the propensity 
score matching approach are somewhat different from those based on the poststratification and 
imputation approaches. The overall results of Eigures 3 through 5 show that, based on all 
approaches, the curvilinear function should be selected rather than the linear function. 
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Figure 3. First example. Curvilinear vs. linear scaled score differences based on two-anchor 
poststratification. 
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Figure 4. First example. Curvilinear vs. linear scaled score differences based on two-anchor 
imputation. 
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Figure 5. First example. Curvilinear vs. linear scaled score differences based on two-anchor 
propensity score matching. 



A final interest was assessing the implications of using both A i and A2 relative to using only 
A1 in the X-to-7 conversions. To repeat a concern made when introducing this example, the 0.90 
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correlations between Ai and tests X and Y may not be large enough to completely account for the large 
differenees between P and Q. While the use of A2 would appear to improve the X-to-7 eonversion 
because it improves the correlations between the anchors and the tests, the question is what the impact 
is on the aetualX-to-7 conversion. For this assessment, the three approaches’ curvilinear Wto- 7 
functions were computed with only Ai using the previously described presmoothing, test score 
distribution estimation and equating steps. The two-anchor (Ai and A2) versus one-anchor (Ai only) 
differences for the approaches are plotted in Figures 6 (poststratification), 7 (imputation), and 8 
(propensity score matching). For the poststratification approach, it was possible to compute +/- 2SEED 
lines for the two-anchor versus one-anchor differences. 



I ♦ Two Anchors - One Anchor 
4 
3 

2 - 
1 










Score 



+!- 2SEED 




60 



Figure 6. First example. Curvilinear two-anchor poststratification vs. curvilinear 
one-anchor poststratification. 
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Figure 7. First example. Curvilinear two-anchor imputation vs. curvilinear 
one-anchor imputation. 
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Figure 8. First example. Curvilinear two-anchor propensity score matching vs. curvilinear 
one-anchor propensity score matching. 

The results show that the two-anchor function is lower than the one-anchor equating 
function for most of the X scores. These differences are about 1 score point at their largest and 
many exceed the +1-2 SEED lines (Eigure 6). The poststratification, imputation, and propensity 
score matching approaches produce similar results in terms of the magnitude of the two-anchor 
versus one-anchor score differences. The score differences based on propensity score matching 
(Eigure 8) are visibly different from those of the poststratification (Eigure 6) and imputation 
(Eigure 7) approaches. 

Second Example: Equating Composite Test Forms With Multiple-Choice and Constructed 

Response Anchors^ 

The second considered example involves the equating of the composite (multiple-choice 
plus constructed response) forms of a teacher certification exam. There are two anchors 
available, where Ai denotes a 12-item multiple-choice anchor andA2 denotes a sum of six 2- 
point constructed response items that is multiplied by 2 when included in the composite. 
Composite form X has a total of 70 points and composite form 7 has a total of 72 points. To 
account for human rater drift in the scoring of A2, only a small sample of the available examinee 
data for group Q are used {Nq = 403) in the equatings; the data from examinees whose A2 
responses were re-scored when group P’s A2 responses were scored. The result is that in P’s 
trivariate (X, Al, A2) distribution A2 is internal to and contributes to the X score while in Q’s 

'i 

trivariate (Y, Al, A2) distribution, A2 is external to and does not contribute to the Y score . 
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Comparisons of the P and Q groups’ test and anchor scores provide somewhat ambiguous 
results (Tables 6 and 7), suggesting that the large group of P examinees is essentially equivalent 
to Q on Ai (i.e., the mean differences between P and Q are 0.02 standardized units) but 
considerably less able than Q on A2 (i.e., the mean differences between P and Q are -0.20 
standardized units). For both groups, A2 is more highly correlated with composite forms X and Y 
than Ai (0.79 and 0.69 vs. 0.59 and 0.60). Tables 6 and 7 show that Ai and A2 are weakly 
correlated with each other (0.28 and 0.30), an expected result for composite forms where 
multiple-choice and constructed response questions likely measure different skills and abilities. 



Table 6 



Second Example: Statistics for Test X and Anchors Al and A2 in P (Np = 2,875) 





Min. 


Max 


Mean 


SD 


Skew 


Kurtosis 
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observed & 
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(possible) 










X 


Ai 


A2 


X 


13 & (0) 


64 & (70) 


43.10 


8.06 


-0.42 


0.22 


1.00 






Ai 


1 &(0) 


12&(12) 


8.28 


2.12 


-0.42 


-0.08 


0.59 


1.00 




A2 


0&(0) 


24 & (24) 


12.39 


4.47 


-0.32 


-0.10 


0.79 


0.28 


1.00 


Table 7 


















Second Example: Statistics for Test Y and Anchors Al and A2 


in Q (Nq = 


403) 








Min. 


Max 


Mean 


SD 


Skew 


Kurtosis 


Correlations 




observed & 
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Y 


Ai 


A2 


Y 


16 & (0) 


64 & (72) 


42.66 


9.47 


-0.45 


0.08 


1.00 






Al 


2&(0) 


12&(12) 


8.23 


2.18 


-0.30 


-0.53 


0.60 


1.00 




A2 


0&(0) 


24 & (24) 


13.35 


5.26 


-0.56 


-0.08 


0.69 


0.30 


1.00 



Two questions are particularly important for evaluating the use of two anchors in this 
situation. The first is whether the multiple-choice anchor, Al, can adequately account for 
examinee group differences on the X and Y composite forms. The use of multiple-choice anchors 
to equate composite forms is not the most recommended practice in recent research (Kim, 
Walker, & McHale, 2008) and is not likely to produce strong equatings (see Note 2, p. 31). 
However, in several testing programs, there is interest in using multiple-choice anchors due 
partly to the high costs associated with the use of constructed response anchors. The squared 
correlations from predicting the X and Y scores with Ai are 0.35 (P) and 0.36 (0. The squared 
correlations from predicting the X and Y scores with both Ai and A2 are 0.78 (P) and 0.64 (0. 
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The substantial increases in squared correlations suggest that using both anchors rather than A1 
will result in a significant improvement in the results. The differences 'mAl and A2’s indications 
of P and Q differences (i.e., P and Q are nearly equivalent on Ai but different on A2) may also 
contribute to results that differ when using only Ai rather than using both Ai and A2. The 
question of how the results based on Ai differ from those based on Ai and A2 for the 
poststratification, imputation, and propensity score matching approaches requires evaluation. 

A second question that arises when both multiple-choice and constructed response 
anchors are available is how to make the best use of the two anchors. While the approaches 
described in this study utilize each anchor to the extent that they jointly correlate with test scores, 
in practice the two anchors are usually used as a single summed score. The squared correlations 
from predicting the X and Y scores with the summed anchor are 0.77 (P) and 0.62 (Q). The 
squared correlations from predicting the X and Y scores with the separate anchors are 0.78 (P) 
and 0.64 (0. From the perspective of correlations and prediction accuracy, there is potential for 
slight improvements in equating from using the two anchors separately rather than in a summed 
form. The implications of the slight improvements in the squared correlations are evaluated in 
direct comparisons of the equating results based on the two anchors and on the single summed 
anchor. 

The Use of AI and A2 in the Equating Process 

The following paragraphs describe the results of using poststratification, imputation, and 
propensity score matching to compute the A-to-T equating function using anchors A 7 and A2. 

The discussion focuses on the major steps of equating, including presmoothing, the estimation of 
the X and Y distributions in synthetic population T, and the comparison of linear and curvilinear 
two-anchor equating functions with respect to equated score differences and standard errors. 

Two additional interests are what results based on the poststratification, imputation, and 
propensity score matching approaches suggest about the importance of A2 for the actual equating 
results, and what the approaches suggest about using a single summed anchor rather than the 
separate use of Ai and A2. 

Presmoothing. Loglinear models were used to presmooth the (X, AI, A2) and (Y, AI, A2) 
trivariate distributions. For the poststratification method, the loglinear presmoothing is applied to 
P’s (X, AI, A2) distribution and Q’s (Y, AI, A2) distribution (e.g., von Davier et ah, 2004). For 
missing data imputation, the loglinear presmoothing is applied to population T’s trivariate (X, 
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Al, A2) and (Y, Al, A2) distributions (Liou & Cheng, 1995; Liou et ah, 2001). The loglinear 
models used to presmooth these four trivariate distributions were based on the same 
parameterization, fitting five moments in the marginal test distributions (X or Y), five moments in 
the Ai distributions, five moments in the A2 distributions, and the first eross-moments of the 
joint (test, Al ), (test, A2), and (A1,A2) distributions and the (X, Al, A2) and (Y, Al, A2) 
distributions. The models for P treated A2 as an internal anehor and the models for Q treated A2 
as an external anehor. These models were seleeted beeause they resembled the models aetually 
used to equate these data in praetiee, and also beeause evaluations of residuals and model fit 
indiees did not reveal obvious model misspeeifieations. 

For the propensity score matching approach, propensity scores were estimated by 
predicting P and Q group membership for all of the P and Q data, using logistic regression. 
Several logistic regression models similar to Equation 6 were considered, including those that 
used linear and quadratic functions of Ai and A2 and AM2. Several categorization schemes were 
considered for the propensity scores produced from the logistic regression models. The logistic 
regression model and propensity score categorization that produced categorized propensity 
scores with the highest correlations with the tests (0.67) and also produced the smallest 
standardized mean differences between P and Q for Al and A2 within the categories were 
selected. The propensity scores were obtained by predicting membership in P and Q for all of the 
P and Q group data with the following logistic regression model; 

Propensity(P | Al, A2) = ’ 0 ) 

These propensity scores were divided into 10 categories based on the predicted probabilities’ 
deciles. Bivariate loglinear presmoothing models were used to presmooth 
P’s (X, CategorizedPropensity) and Q’s (Y, CategorizedPropensity) bivariate distributions, 
fitting five moments in the test distributions, five moments in the categorized propensity score 
distribution, and the first cross-moment between the test and categorized propensity scores. 

Test score distribution estimation in synthetic population T. All of the presmoothing 
results from the presmoothing step were used to estimate the X and Y score distributions in 
synthetic population, T = wP + (1 - w)Q, w = 0.5 . For the poststratification method, this 
estimation was done using Equations 4 and 5 and P and Q’s presmoothed trivariate distributions. 
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For missing data imputation, this estimation was done using the trivariate distributions imputed 
for population T in the presmoothing step. For propensity seore matehing, this estimation was 
done using Equations 2 and 3 and P and Q’s presmoothed bivariate distributions of the tests and 
eategorized propensity seores. 

The descriptive statistics of population T’s X and Y score distributions are shown in 
Table 8. The score distributions are plotted in Figures 9 and 10. The score distributions are very 
similar for the two-anchor poststratification and imputation approaches. The X and Y score 
distributions based on propensity score matching differ from those of the poststratification and 
imputation approaches, particularly in their standard deviations. The X and Y means in Table 5 
indicate that X is easier than Y. 

Table 8 



Second Example: Synthetic Population Distributions for X and Y, Two-Anchor Matching 
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Figure 9. Second example. Relative frequency distributions of New Form X in Synthetic 
Population T based on the multiple-choice (MC) and constructed-response (CR) anchor 
scores. 
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Conversion functions and their evaluation. In evaluating the equating results from the 
poststratification, imputation, and propensity seore matching approaches, three comparisons 
were of interest. The first comparison assesses the curvilinearity of the approaches’ two-anchor 
equating functions. The second comparison addresses the question of whether the multiple- 
choice anchor, Al, can adequately account for examinee group differences on the composite 
forms with both multiple-choice and constructed response items. The third comparison considers 
whether the approaches’ two-anchor results differ from their results when a summed anchor is 
used, A i+A2. 



♦ Post Stratification ■ imputation * Propensity Score Matching 



70 

Figure 10. Second example. Relative frequency distributions of Reference Form Y in 
Synthetic Population T based on the multiple-choice (MC) and constructed-response (CR) 
anchor scores. 

To evaluate the curvilinearity of the approaches’ two-anchor equating functions, linear 
and curvilinear two-anchor equating functions were computed using the poststratification, 
imputation, and propensity score matching approaches. The differences between these equating 
functions are plotted in Figures 11 (poststratification), 12 (imputation), and 13 (propensity score 
matching). For poststratification and imputation, the curvilinear equating function’s differences 
from the linear equating function are within two SEEDs throughout the score range. For 
propensity score matching, the differences between the curvilinear and linear equating functions 
are small and within two SEEDs for all but the lowest scores. The overall results of Eigures 1 1 
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through 13 show that, based on all three approaehes, the linear equating function should be 
selected rather than the curvilinear equating function. 



[' ♦ Curvilinear - Linear +/- 2SEED | 




Figure 11. Second example. Curvilinear versus linear equated score differences based on 
two-anchor poststratification. 
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Figure 12. Second example. Curvilinear versus linear equated score differences based on 
two-anchor imputation. 

For the question of whether the multiple-choice anchor Ai can adequately account for 
examinee group differences on the X and Y composite forms, linear equating functions based 
on using the two anchors, Al and A2, were computed and compared to linear scaling 
functions based on only Ai for the poststratification, imputation, and propensity score 
matching approaches. The differences between these functions are plotted in Figures 14 
(poststratification), 15 (imputation), and 16 (propensity score matching). The results are very 
similar for the poststratification and imputation approaches, showing that the two-anchor 
equating function is higher than the one-anchor scaling function for X scores below 30, and 
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lower than the one-anehor scaling function for scores between 30 and 70. These differences 
are about 3 score points, at their largest, and exceed the +/- 2 SEED lines for scores above 40 
(Eigure 14). Eor the propensity score matching approach, the two-anchor equating function is 
lower than the one-anchor scaling function for the whole score range by about 1.5 points 
(Eigure 16). The somewhat different results produced from the propensity score matching 
approach correspond to differences in the estimates of population T’s X and Y standard 
deviations when based on propensity score matching rather than on the poststratification and 
imputation approaches (Table 8). 



I ♦ Curvilinear - Linear h/- 2SEED | 




Figure 13. Second example. Curvilinear versus linear equated score differences based on 
two-anchor propensity score matching. 
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Figure 14. Second example. Linear two-anchor poststratification versus linear one-anchor 
poststratification. 
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Figure 15. Second example. Linear two-anchor imputation versus linear one-anchor 
imputation. 
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Figure 16. Second example. Linear two-anchor propensity score matching versus linear 
one-anchor propensity score matching. 

For the question of whether the use of Ai and A2 in a summed form produces equating 
results that differ from the results obtained from using A i and A2 jointly, linear equating functions 
were computed using the sum of Ai and A2 as an anchor, and these equating functions were 
compared to linear equating functions computed using the Al and A2 jointly. The differences 
between these equating functions are plotted in Figures 17 (poststratification), 18 (imputation), and 
19 (propensity score matching). The results show that, for the poststratification and imputation 
approaches, the two-anchor equating function is almost identical to the summed anchor equating 
function for the lower range of the X scores and is slightly higher than the summed anchor equating 
function for the upper range of the X scores. Although the differences are small (between one third 
to one half of 1 score point), many exceed the +/- 2 SEED lines (Eigure 17). 
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Figure 1 7. Second example. Linear two-anchor poststratification versus linear summed 
anchor poststratification. 
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Figure 18. Second example. Linear two-anchor imputation versus linear summed anchor 
imputation. 
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Figure 19. Second example. Linear two-anchor propensity score matching versus linear 
summed anchor propensity score matching. 
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For the propensity score matching approach (Figure 19), the two-anchor equating 
function is lower than the summed anchor equating function for X scores below 48 and is higher 
than the summed anchor equating function for X scores above 48. As in the results of comparing 
equating functions based on only Ai to those based onAi and A2, the difference between the 
propensity score matching results and the poststratification and imputation results correspond to 
the differences in the approaches’ estimates of F’s X and Y standard deviations (Table 8). The 
slightly different results between the two-anchor and summed anchor equating functions based 
on the poststratification and imputation approaches are more consistent with the slightly different 
squared correlations of the summed anchors and tests (0.77 and 0.62) and the joint anchors and 
tests (0.78 and 0.64) than the larger differences indicated in propensity score matching. 

Discussion 

The traditional NEAT design’s incorporation of a single anchor to address examinee 
group differences on test scores has been a long-standing concern in equating research (Angoff, 
1984; Kolen & Brennan, 2004; Livingston et ah, 1990). This concern appears to be most serious 
when NEAT equating is conducted across examinee groups that are extremely different in ability 
and/or when the tests measure content that is likely to be broader than the content measured by 
the anchor. Various proposals have been made for using more than one anchor to account for 
examinee group differences. Three approaches to incorporating two anchors in equating have 
been proposed but not extensively studied or compared: poststratification, missing data 
imputation, and propensity score matching (Angoff, 1984; Liou & Cheng, 1995; Liou et ah, 

2001; Livingston et ah, 1990). This paper described how the approaches could be used to 
implement the assumptions and equating of the poststratification method. The three approaches 
were demonstrated in two situations where the use of two anchors would appear to be warranted. 

The results of this study’s applications showed that the poststratification, imputation, and 
propensity score matching approaches could all be used in similar ways to incorporate two 
anchors and compute equating and scaling functions. The poststratification and imputation 
methods produced results that were essentially identical for both examples of this study, a 
finding that was not emphasized in prior evaluations of imputation applications in equating (Liou 
& Cheng, 1995; Liou et ah, 2001). Propensity score matching produced results that were similar 
to the results of the other approaches for the first example, but somewhat different results for the 
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second example. The next section describes the issues involved in the three approaches in more 
detail. 

Two-Anchor Equating Methods 

As shown in this paper, the missing data imputation approach is a more limited version of 
the poststratification approach. Imputation uses the same loglinear presmoothing methods as 
used in poststratification equating, but incorporates the poststratification assumptions directly 
into the presmoothing to impute test and anchor distributions for synthetic population T. The 
result is a more complex equating process that produces results that are similar to those of 
poststratification equating. One difficulty with imputation is that the standard errors and SEEDs 
tend to be inaccurate due to difficulties with incorporating sample sizes into standard error 
formulas that reflect both the complete and the imputed data. The unresolved question is how to 
simultaneously account for complete data on Ai and A2, but incomplete data on X and Y. 

The application of propensity score matching to two-anchor equating requires logistic 
regression models and categorizations that add ambiguity and probably inaccuracy to its results. 
Eor this study’s second example, where these modeling and categorization decisions were 
complex due to each anchor giving inconsistent information about the examinee groups, the 
propensity score matching produced X and Y distributions on T with standard deviations that 
were different from those of the poststratification and imputation approaches (Table 8). The 
resulting linear functions based on propensity score matching had considerably different slopes 
from those of the other methods, and the differences between equating functions reflected these 
different slopes (Eigures 16 and 19). Some follow-up simulations showed that the standard errors 
and SEEDs based on propensity score matching were inaccurate due to uncertainty in how to 
incorporate the influences of the categorization decisions in the equating variability estimates. 
Approaches other than this study’s use of categorized propensity scores would have likely 
produced more closely matching examinees from the P and Q groups (Rosenbaum & Rubin, 

1985; Rubin & Thomas, 1996; Rubin & Thomas, 2000), but these approaches involve discarding 
data and matching the individuals of one group to those in the other group rather than estimating 
score distributions for a synthetic mixture of the P and Q groups’ data. 

The problems with propensity score matching prompt the question of whether it should 
be used at all in equating and scaling situations. While the results of this study discourage the use 
of propensity score matching for situations involving one or two anchors, propensity score 
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matching could still be valuable when there are more than two possible anehors. For example, a 
situation could be encountered where there are several available anchors, including examinees’ 
pass/fail decisions, grade point averages, sealed item response theory (IRT) thetas, and more than 
one internal anehor. For this situation, the loglinear presmoothing models used by the 
poststratifieation and imputation approaehes would be exeeedingly large and mueh more 
unwieldy than the logistie regression modeling used by propensity seore matehing. 

The overall results of this study suggest that when using two anehors, the 
poststratifieation approaeh works better than the imputation and propensity seore matehing 
approaches. Poststratifieation is the most flexible approaeh in terms of the SEEDs that ean be 
produeed for evaluating competing equating and scaling functions. Some follow-up simulations 
have shown that the standard errors and SEEDs for poststratifieation functions are more aeeurate 
than those of the imputation and propensity seore matching approaches. To the extent one-anehor 
poststratifieation equating funetions are biased due to test-anehor eorrelations that are too small 
(Eivingston, 2004), two-anehor poststratifieation improves aeeuracy beeause the two anehors 
likely have larger eorrelations with the tests than one anehor. A question for further study is 
whether approaehes ean be developed for ineorporating multiple anehors outside of the 
poststratifieation framework. 

Other Two- Anchor Possibilities 

In the situation of NEAT equating with one anchor, the chained equating approaeh may 
be more aeeurate than poststratifieation beeause it ignores the test-anehor eorrelations 
(Eivingston, 2004; Eivingston et ah, 1990). The incorporation of two anchors into chained 
equating is less straightforward than for poststratifieation equating. Chained equating is based on 
the marginal distributions of the tests and anehors involved, so that there are several possible 
ehained equating funetions based on eaeh anehor that ean potentially be used. If multiple anehors 
are used in chained equating, there is a different ehained equating funetion eorresponding to eaeh 
possible order of the anehors in the equating chain. Reasonable ways to convert two anehors into 
a summed, single anchor for chained equating exist in some situations (this study’s seeond 
example) but not in others (this study’s first example). Perhaps the way to use multiple anehors 
that have no obvious way of being summed in chained equating is through eonverting them into 
a single propensity seore. 



26 




Another alternative to poststratifieation equating might be to poststratify on the joint 
distribution of one or both of the anehors’ expeeted true seores rather than on the joint 
distribution of their observed seores. This strategy would be analogous to the Levine observed 
equating approaeh. Some work has been done to eonduet loglinear presmoothing on observed 
and expeeted true seore distributions (Chen & Holland, in preparation). Sueh an approaeh eould 
expand on the flexibility of this study’s two-anehor poststratifieation approaeh. In partieular, 
some potential anehors may be eonsidered as direetly measuring what the tests measure, while 
others are predietors of examinee group membership but not neeessarily eongenerie with the 
tests (Wright & Dorans, 1993). The ability to presmooth the expeeted true seore distributions of 
the first type of anehors and presmooth the observed seore distributions of the seeond type of 
anehors eould result in the most appropriate treatment of multiple anehors in NEAT equating. 

Terminology Implications 

An important question in this paper’s diseussion of the first and seeond examples was 
whether the two-anehor results produeed for the first example and the one-anehor results 
produeed for the seeond example eonstituted equatings or whether they were more appropriately 
referred to as scalings. Although interehangeable seores was an equating goal that was eommon 
to both examples, scaling and equating labels were used to distinguish results (see Notes 1 and 2, 
p. 30). These labels were used primarily based on equating praetitioners’ judgments about how 
representative the anehors were of the tests involved. The equating literature’s syntheses and 
studies are not eompletely elear on what label is most appropriate for a seore eonversion that 
utilizes one or more anehors that represent the tests to varying degrees (Holland & Dorans, 2006; 
Dorans, Liu, & Hammond, 2008; Wright & Dorans, 1993). A need for future diseussions of 
equating and sealing terminology is elarifieation and eriteria for what types of anehors are likely 
to produee sealings and equatings. 
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Notes 



' Practitioners from the testing program where this first example’s data came from have used 
internal anehors for some test score eonversions and external anchors for others. These 
praetitioners refer to the test score conversions produced from internal anchors as equatings. 
Conversions based on external anchors are referred to as scalings rather than equatings 
beeause the external anchor is not completely representative of the tests. Though the general 
goal of this first example is to produce an equating function for two tests, the two anchor 
results produeed from using both the internal and external anchors will be referred to as 
scalings rather than as equatings. 

2 

Equating practitioners often regard conversions of composite seores using anchors composed of 
multiple-choiee and constructed response items as equatings and eonversions using anchors 
composed of only multiple-choice items as sealings. Though the general goal of this second 
example is to produce an equating for two eomposite forms, the results produeed from using 
both multiple-choice and constructed response items in the anehor will be referred to as 
equatings and the results produeed from using only multiple-ehoice items in the anehor will 
be referred to as scalings. 

In the equating sample, Q's,A2 responses on the reference form were rescored by the same pool 
of readers who scored P’s A2 responses on the new form. Therefore, for Q, A2 is external to 
total score T beeause Y score eame from the operational scoring at the time when reference 
form was administered; the part score on A2 that eontributed to Y was given by another set of 
readers. 
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Appendix 

SEEDS for Two-Anchor Equating and Scaling Functions 

This appendix provides an overview of the standard errors of equating (or scaling) 
differences (SEEDs) that are used to evaluate the two anchor functions. The SEED has the 
general form (von Davier, Holland, & Thayer, 2004) 

SEEDy(x)= IJeiJoFl^" (Al) 

where the J^j-C terms denote the transformation of the loglinear presmoothed distributions into 

the X and Y score probabilities in synthetic population T, and the terms denote vectors of the 

derivatives of the equating (or scaling) functions with respect to the X and Y score probabilities 
in synthetic population T. Different SEEDs can be calculated using Equation Ai. 

Approaches’ Loglinear Presmoothing Models and Their C terms 

The poststratification, imputation, and propensity score matching approaches all rely on 
loglinear models for their presmoothing. Eor the poststratification and imputation approaches, 
the loglinear models used are trivariate models of the (A, Al, A2) and (T, Al, A2) distributions in 
the data provided by P and Q (poststratification) or directly in the synthetic T distribution 
(imputation). To illustrate, the loglinear presmoothing model of the (X, Al, A2) distribution has 
the following form, 

log. (PjiJ = « + z (^J y Paih («1/ )' + Z Paig 

/=1 h=l g=l 

f=0 e=o d=o ^ (A2) 

where is the probability of the score xj, Ali, a2m (i.e., score Xj on test X, score Ali on test Al, 
and score a2m on test A2) and the a and P ’s are free parameters estimated in the model-fitting 
process. The propensity score matching approach also uses a loglinear model for its 
presmoothing, but on the bivariate distribution of the tests and the categorized propensity score. 
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To illustrate, the loglinear presmoothing model of the (X, CAT Propensity(P | Al, A2) ) 
distribution has the following form, 



log. (P ;7 ) = « + S )' + Z l^ah (CatPropensity(P | Al, A2), )'' 

i=l h=\ 

+E (a:^. )* (CatPropensity(P | A1,A2),)^ 

^=1 /=i 



(A3) 



The C matriees for loglinear models sueh as Equations A2 and A3 have the same number of 
column s as the parameters in the loglinear models and can be computed as described in Holland 
and Thayer (2000). 



Approaches’ Jj,p Terms 

The Jjjj. matrices are based on how the C matrices from the poststratification, 
imputation, and propensity score matching approaches are transformed into X and Y probability 
distributions for synthetic population T. The transformations are conceptually described in 
Equations 2 through 4 in this paper and are described in more detail in von Davier et al. (2004). 
The one-anchor that is directly used in propensity score matching is specifically described 

in von Davier et al. The two-anchor Jpp that is used in two-anchor poststratification extends the 

computations of the one-anchor Jpp by using the joint probabilities of the two anchors, Al and 

A2, rather than the univariate probabilities of the single anchor, A. The imputation Jpp is based 

on the single group design described in von Davier et al. because imputation produces X and Y 
distributions for a single group, T, in its loglinear presmoothing step. 



Jg Terms 

The Jg terms are the derivatives of the equating (or scaling) functions with respect to the 
X and Y score probabilities. Throughout this paper, the terms pertain to linear and curvilinear 
kernel functions and are similarly computed for the poststratification, imputation, and propensity 
score matching approaches. The details for computing are in von Davier et al. (2004). 
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SEEDS and Their J^JopC Terms 

SEEDs can be calculated as in Equation A1 based on the two functions’ J^JofC terms. 
The poststratification, imputation, and propensity score matching approaches can all be used to 
compute SEEDs to compare linear and curvilinear functions that differ only in their values. 

The poststratification approach is somewhat more general than imputation and propensity 
score matching, so that SEEDs for additional function comparisons can be computed. Because 
the loglinear presmoothing models and the estimation of T’s distributions are done in separate 
steps in poststratification equating, it is possible to add additional conversions of the loglinear 
presmoothing results. Two such conversions of interest involve transforming the trivariate 
loglinear presmoothed distributions into bivariate distributions, by either aggregating the 
presmoothed results over one anchor or over the sum of the scores on the two anchors. When 
these conversions are applied to the loglinear presmoothed results and its C matrices and are 
used to compute one-anchor results, SEEDs for evaluating two-anchor versus one-anchor or two- 
anchor versus a summed anchor functions can be computed. 



33 




